Eecient Exploration for Optimizing Immediate Reward
نویسنده
چکیده
We consider the problem of learning an eeective behavior strategy from reward. Although much studied, the issue of how to use prior knowledge to scale optimal behavior learning up to real-world problems remains an important open issue. We investigate the inherent data-complexity of behavior learning when the goal is simply to optimize immediate reward. Although easier than reinforcement learning, where one must also cope with state dynamics , immediate reward learning is still a common problem and is fundamentally harder than supervised learning. For optimizing immediate reward, prior knowledge can be expressed either as a bias on the space of possible reward models, or a bias on the space of possible controllers. We investigate the two paradigmatic learning approaches of indirect (reward-model) learning and direct-control learning, and show that neither uniformly dominates the other in general. Model-based learning has the advantage of generalizing reward experiences across states and actions, but direct-control learning has the advantage of focusing only on potentially optimal actions and avoiding learning irrelevant world details. Both strategies can be strongly advantageous in diierent circumstances. We introduce hybrid learning strategies that combine the beneets of both approaches, and uniformly improve their learning ee-ciency.
منابع مشابه
Efficient exploration for optimizing immediate reward
We consider the problem of learning an effective behavior strategy from reward. Although much studied, the issue of how to use prior knowledge to scale optimal behavior learning up to real-world problems remains an important open issue. We investigate the inherent data-complexity of behavior-learning when the goal is simply to optimize immediate reward. Although easier than reinforcement learni...
متن کاملBoredom, Information-Seeking and Exploration
Any adaptive organism faces the choice between taking actions with known benefits (exploitation), and sampling new actions to check for other, more valuable opportunities available (exploration). The latter involves informationseeking, a drive so fundamental to learning and long-term reward that it can reasonably be considered, through evolution or development, to have acquired its own value, i...
متن کاملOptimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-stationary Rewards
In a multi-armed bandit (MAB) problem a gambler needs to choose at each round of play one of K arms, each characterized by an unknown reward distribution. Reward realizations are only observed when an arm is selected, and the gambler’s objective is to maximize his cumulative expected earnings over some given horizon of play T . To do this, the gambler needs to acquire information about arms (ex...
متن کاملStochastic Multi-Armed-Bandit Problem with Non-stationary Rewards
In a multi-armed bandit (MAB) problem a gambler needs to choose at each round of play one of K arms, each characterized by an unknown reward distribution. Reward realizations are only observed when an arm is selected, and the gambler’s objective is to maximize his cumulative expected earnings over some given horizon of play T . To do this, the gambler needs to acquire information about arms (ex...
متن کاملInformation-Seeking, Learning and the Marginal Value Theorem: A Normative Approach to Adaptive Exploration
Daily life often makes us decide between two goals: maximizing immediate rewards (exploitation) and learning about the environment so as to improve our options for future rewards (exploration). An adaptive organism therefore should place value on information independent of immediate reward, and affective states may signal such value (e.g., curiosity vs. boredom: Hill & Perkins, 1985; Eastwood e...
متن کامل